Adding GOMEMLIMIT to compactor libsonnet#1758
Adding GOMEMLIMIT to compactor libsonnet#1758joe-elliott merged 5 commits intografana:mainfrom ie-pham:jennie/compactor-jsonnet
Conversation
| $.util.readinessProbe + | ||
| (if $._config.variables_expansion then container.withArgsMixin(['--config.expand-env=true']) else {}), | ||
| (if $._config.variables_expansion then container.withArgsMixin(['--config.expand-env=true']) else {}) + | ||
| container.withEnvMixin([envVar.new('GOMEMLIMIT', $._config.compactor.resources.limits.memory + 'B')]), |
There was a problem hiding this comment.
if someone were to set _config.compactor.resources.limits.memory to null i think this would break. i think an if statement like the above should be able to check before adding the env var.
|
I'm curious, what is your experience with |
|
I haven't had a chance to reevaluate the impact of ballast in a modern go application. I do think that GOMEMLIMIT and ballast are designed to serve different purposes. Ballast is designed to signal to the GC that you are expecting to create an application that consumes a large amount of memory. It prevents the GC from overworking in the startup phase of your application. GOMEMLIMIT is meant to be a hard limit at which you are ok with expending a large amount of CPU (to the detriment of the application's main focus) to GC. This works quite well for compactors. We'd much rather slow down a compactor then have it OOM and throw away the last few minutes of work. For other components it's not so obvious what we'd prefer. If a querier takes on too much work is it better to OOM so the queries get retried? or just answer them far slower. We've experimented with GOMEMLIMIT on ingesters and, as the allocated memory GOMEMLIMIT, the ingester starts failing a larger and larger percentage of its requests. If you do some experimentation with these parameters please report back! The goal is to use these tools to create a Tempo that is easy to operate and fulfills the operational expectations of its users. |
|
Thanks for your detailed answer @joe-elliott! Unfortunately I didn't have a chance to evaluate the parameters myself yet. From reading the blogs at https://weaviate.io/blog/2022/08/GOMEMLIMIT-a-Game-Changer-for-High-Memory-Applications.html and https://www.sobyte.net/post/2022-06/memory-ballast-gc-tuner/ I got the impression that GOMEMLIMIT is making the ballast obsolete, but I think you're right, there are two different scenarios in play: OOM (even if memory is available) and GC overworking. There's also |
|
I did some experimentation with ballast and GOMEMLIMIT. I created (except for the parameters under test) identical deployments of Tempo v1.5 in the microservices mode, and used When using 3 load generator replicas with 1000 configured virtual users, each sending traces every 1-5 seconds, I observed on average 3.15 GC collections/minute for the distributor and 6.56 collections/minute for the ingester. With 1 GB ballast the numbers went down to 1.95 for the distributor and 2.85 for the ingester. In relative terms it's an improvement of 38% and 56% respectively, but in absolute numbers at most 4 less GC operations per minute are negligible I'd say. A test with I'm not 100% sure how to interpret the results. Maybe the load was too less (too less spans), or I focused on the wrong metric (I didn't look at the GC duration, as the GC should happend in the background per prometheus/client_golang#618 (comment)), or I made some error or wrong conclusion in the test. Maybe the results are also fine and its only relevant in huge deployments :) For reference, here's the entire test setup including the manifests and Grafana dashboard: https://github.com/andreasgerstmayr/tempo-perf-eval/tree/main/ballast |
|
This analysis is excellent and makes sense to me. When setting What about these results did you find surprising? |
I thought I'll see a much higher improvement in terms of absolute numbers - going from 3.15 GC collections per minute for the distributor and 6.56 collections per minute for the ingester down to 1.95 for the distributor and 2.85 for the ingester is in absolute numbers not a big difference, I'd say it's negligible in this case. I guess the synthetic load wasn't enough, or tempo's memory handling is more efficient compared to the application from twitch (https://blog.twitch.tv/en/2019/04/10/go-memory-ballast-how-i-learnt-to-stop-worrying-and-love-the-heap/), which went from about 256 ops/min to about 4 ops/min. Probably the Go GC efficiency also improved over the years. |
What this PR does: Adding GOMEMLIMIT to compactor libsonnet
Which issue(s) this PR fixes:
Fixes #
Checklist
CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]